author: “Del Middlemiss”
date: “15th August 2019”

1 Learning Objectives

  • Understand and be able to build upon simple linear regression by adding continuous and categorical predictors.
  • Be able to investigate multiple predictor-predictor and predictor-response correlations
  • Understand and be able to use the +, * and : operators in formula notations
  • Have seen a parallel slopes model
  • Be able to add interactions between predictors and interpret their effects
  • Understand that multiple continuous predictors lead to ‘planes of best fit’
  • Be able to use conditional plots to investigate interactions between continuous predictors

New packages: mosaicData, GGally, ggiraphExtra

Duration - 120 minutes


2 Starting point - simple linear regression

We’re going to introduce the elements of multiple linear regression one-by-one to help you understand their use and effect. Our starting point will be simple linear regression with a response variable \(y\) and a continuous predictor \(x\). This is just the sort of linear regression scenario we looked at in module 2! We’ll then look at adding:

  • One or more additional categorical predictor(s)
  • One or more additional continuous predictor(s)
  • Interaction terms

2.1 The RailTrail data

Let’s have a look at a dataset with multiple predictors. The RailTrail data from the mosaicData package contains 90 daily observations of the volume of users on a ‘railtrail’ (i.e. a former rail line converted to a path) together with a number of other factors: the average daily temperature, whether that day was a weekday, precipitation and cloudcover etc… A laser counting station was used to gather the user volume values, coupled to a set of weather sensors for the other factors.

We’re going to build a regression model for user volume as a function of one or more predictor variables.

library(mosaicData)
library(tidyverse)
glimpse(RailTrail)
## Observations: 90
## Variables: 11
## $ hightemp   <int> 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41, 5…
## $ lowtemp    <int> 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49, 3…
## $ avgtemp    <dbl> 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67.5,…
## $ spring     <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,…
## $ summer     <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ fall       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ cloudcover <dbl> 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, 8.…
## $ precip     <dbl> 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.00,…
## $ volume     <int> 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 432…
## $ weekday    <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FAL…
## $ dayType    <chr> "weekday", "weekday", "weekday", "weekend", "weekday"…

Task - 3 mins

Take a few minutes to acquaint yourself with the RailTrail data * perhaps try running summary() * plot volume against a few predictors that you think may be of interest

Solution

summary(RailTrail)
##     hightemp        lowtemp         avgtemp          spring      
##  Min.   :41.00   Min.   :19.00   Min.   :33.00   Min.   :0.0000  
##  1st Qu.:59.25   1st Qu.:38.00   1st Qu.:48.62   1st Qu.:0.0000  
##  Median :69.50   Median :44.50   Median :55.25   Median :1.0000  
##  Mean   :68.83   Mean   :46.03   Mean   :57.43   Mean   :0.5889  
##  3rd Qu.:77.75   3rd Qu.:53.75   3rd Qu.:64.50   3rd Qu.:1.0000  
##  Max.   :97.00   Max.   :72.00   Max.   :84.00   Max.   :1.0000  
##      summer            fall          cloudcover         precip       
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 3.650   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median : 6.400   Median :0.00000  
##  Mean   :0.2778   Mean   :0.1333   Mean   : 5.807   Mean   :0.09256  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.: 8.475   3rd Qu.:0.02000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :10.000   Max.   :1.49000  
##      volume       weekday          dayType         
##  Min.   :129.0   Mode :logical   Length:90         
##  1st Qu.:291.5   FALSE:28        Class :character  
##  Median :373.0   TRUE :62        Mode  :character  
##  Mean   :375.4                                     
##  3rd Qu.:451.2                                     
##  Max.   :736.0
RailTrail %>%
  ggplot(aes(x = avgtemp, y = volume)) +
  geom_point()

RailTrail %>%
  ggplot(aes(x = weekday, y = volume)) +
  geom_boxplot()

RailTrail %>%
  ggplot(aes(x = precip, y = volume)) +
  geom_point()


2.2 ggpairs() in the GGally package

Some of the predictors look redundant, so let’s trim them out. We’ll only use a single measure of temperature (say avgtemp), and we can also remove fall and dayType, as they can be computed from other predictors (i.e. they are aliases).

RailTrail_trim <- RailTrail %>%
  select(-c("hightemp", "lowtemp", "fall", "dayType"))

Now we can simply plot the trimmed dataframe to see a useful basic visualisation

plot(RailTrail_trim)

or we can obtain an enhanced version of the same plot using the ggpairs() function in the GGally package!

library(GGally)
ggpairs(RailTrail_trim)

We should examine these plots to look for predictors that appear significantly associated with our response variable (volume in this case). We should also take a keen interest in how the predictors correlate with each other.

2.3 Simple linear regression

Examining the pairs plots above, we see a reasonably strong association between the user volume each day and avgtemp, the average daily temperature (degrees Fahrenheit). Let’s run simple linear regression as follows:

  • The response will be volume
  • The predictor will be avgtemp
  • The simple linear regression equation is \(\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp}\)

We’re hypothesising that people are more likely to use the rail trail in warm weather.

RailTrail_trim %>%
  ggplot(aes(x = avgtemp, y = volume)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Task - 5 mins

Just to get back in the ‘groove’, run a simple linear regression on the RailTrail data, taking volume as response and avgtemp as predictor. Check the regression assumptions. How well does this simple model perform in predicting volume?

[Hint - remember, we can check the regression assumptions by plotting the model object]

Solution

model <- lm(volume ~ avgtemp, data = RailTrail_trim)
par(mfrow = c(2, 2))
plot(model)

summary(model)
## 
## Call:
## lm(formula = volume ~ avgtemp, data = RailTrail_trim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -237.53  -80.00    8.49   55.14  326.67 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.602     63.474   1.569     0.12    
## avgtemp        4.802      1.084   4.428 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.9 on 88 degrees of freedom
## Multiple R-squared:  0.1822, Adjusted R-squared:  0.1729 
## F-statistic: 19.61 on 1 and 88 DF,  p-value: 2.723e-05

The regression doesn’t seem too bad, although there is some evidence of systematic variation not captured in the Residuals vs Fitted plot, some deviation from normality in the high quantile residuals in the Normal Q-Q plot, and evidence of rather mild heteroscedasticity in the Scale-Location plot.

Alas, however, the model isn’t very effective. The \(r^2\) value is \(0.18\), and the residual standard error is \(115.9\). To put the latter in context, let’s see the boxplot of volume.

RailTrail_trim %>%
  ggplot(aes(y = volume)) + 
  geom_boxplot()

The median is just below \(400\) users, and our estimated volume values are accurate to only \(116\) users on average (we get this from the residual standard error)! So our estimates are out by around \(25%\) of the typical user volume

We can hopefully do better by adding further predictors to the model, taking us into the territory of multiple linear regression! Adding predictors might well also fix some of the rather mild breaches of the regression assumptions we have observed. Fingers crossed!

We’ll start by adding a categorical predictor variable, yielding what’s known as a parallel slopes model.


3 Parallel slopes model - adding a categorical predictor

Let’s hypothesise that weekday has an effect on usage of the rail trail. We could probably argue the effect in either direction: that weekdays increase volume due to use by people commuting to work, or perhaps that weekends are busier due to recreational use. We shall see…

Task - 2 mins

Try plotting an appropriate visualisation to determine whether user volume is associated with the weekday predictor.

Solution

Two boxplots split by weekday are appropriate here.

RailTrail_trim %>%
  ggplot(aes(x = weekday, y = volume)) + 
  geom_boxplot()

The boxplots are split, so we can conclude there is an association of the two variables.

To be more specific, let’s calculate the point-biserial correlation coefficient (this is just the normal Pearson correlation coefficient applied to the case of one continuous- and one dichotomous variable).

RailTrail_trim %>%
  summarise(cor = cor(weekday, volume))
This confirms a weak negative correlation.

Systematic model development

We are not really following a process of systematic model development here. We’ll detail that process in the next lesson: for now, we are content just to discuss the ‘ingredients’ of multiple linear regression.

If we wished to be systematic about adding a predictor to our model beyond avgtemp, we would need to see which predictor is most strongly associated with the response volume after the systematic variation due to avgtemp has been subtracted. We would do this iteratively:

  1. …add the ‘most significant’ predictor to the model
  2. get the residuals of the enlarged model: the residuals are the new ‘unexplained’ variation
  3. find the next ‘most significant’ predictor of this ‘unexplained’ variation
  4. go back to step 1…

We would keep cycling in this way until we are satisfied with the model.


Let’s add weekday into our regression model!

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday}\]

In the patsy formula notation, this is just as simple as saying volume ~ avgtemp + weekday

model2 <- lm(volume ~ avgtemp + weekday, data = RailTrail_trim)

The \(+\) operator in patsy formula notation

Note that \(+\) in this formula doesn’t mean ‘add the two variables together’ here, i.e. the regression equation is not something like

\[\widehat{\textrm{volume}} = \textrm{intercept} + b \times (\textrm{avgtemp} + \textrm{weekday})\] Instead, read the \(+\) operator in a formula as meaning ‘add predictor weekday into the model with its own coefficient’. Similarly, the \(*\) operator has a new meaning that we’ll see below (along with a new one: the \(:\) operator).


Let’s see the summary() and diagnostics of the model

summary(model2)
## 
## Call:
## lm(formula = volume ~ avgtemp + weekday, data = RailTrail_trim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -242.92  -75.72    3.15   66.90  280.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  163.022     65.428   2.492  0.01461 *  
## avgtemp        4.541      1.050   4.323 4.08e-05 ***
## weekdayTRUE  -70.320     25.566  -2.751  0.00724 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111.8 on 87 degrees of freedom
## Multiple R-squared:  0.2476, Adjusted R-squared:  0.2303 
## F-statistic: 14.32 on 2 and 87 DF,  p-value: 4.214e-06
par(mfrow = c(2,2))
plot(model2)

  • We see a mild improvement in \(r^2\) from \(0.18\) with the single predictor model, to \(0.25\) here.
  • Both predictor variables are statistically significant: their p-values are below \(0.05\) (or \(0.01\) for precise work)
  • Note how R has treated the weekday variable, it has split it into a single dummy variable weekdayTRUE with weekdayFALSE being absorbed into the intercept. So weekdayFALSE is being treated as the reference level.

Task - 2 mins

How should we interpret the value of the coefficient for weekdayTRUE? Have a think about this, and discuss it with the people around you.

Solution

In the following way: on average, there are approximately \(70\) fewer users on the rail trail each weekday as compared with a Saturday or a Sunday (with avgtemp held constant: we’ll discuss this more fully below).

Why is this called a ‘parallel slopes model’? The easiest way to answer this is by visualising the model. The ggPredict() function in the ggiraphExtra package can help with this!

library(ggiraphExtra)
ggPredict(model2, interactive = TRUE)

So we see the model effectively has two lines: one for weekdays, and another for weekend days. The lines are parallel (their slopes are the same), so we call this a ‘parallel slopes model’.

How do these two lines come about? Let’s look again at the regression equation:

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday}\]

The slope of the line with respect to avgtemp is \(b_\textrm{weekday}\), and remember that weekday is categorical. So for weekday = TRUE values, the line is effectively

\[\widehat{\textrm{volume}} = (\textrm{intercept} + b_{\textrm{weekday}}) + b_{\textrm{avgtemp}} \times \textrm{avgtemp}\]

i.e. just a shift in the intercept, while for weekday = FALSE values, we get

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp}\]

The lines differ only in intercept; they are parallel.

Task - 5 mins

Try adding the summer categorical predictor to the existing model with avgtemp and weekday.

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{summer}} \times \textrm{summer}\]

  • How many lines do you expect to see in this model?
  • Is this a parallel slopes model? [Hint try ggPredict() on the model object]
  • Is the addition of this predictor justified [Hint what is the p-value of \(b_{\textrm{summer}}\)]?

Solution

We expect four lines in this case, as each of weekday and summer can take \(2\) values, so \(2 \times 2 = 4\). We still have a parallel slopes model, as summer is also categorical, so the four lines will be parallel (only their intercepts will differ).

Now let’s run the regression and diagnostics:

model3 <- lm(volume ~ avgtemp + weekday + summer, data = RailTrail_trim)
summary(model3)
## 
## Call:
## lm(formula = volume ~ avgtemp + weekday + summer, data = RailTrail_trim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -236.165  -75.448    6.184   68.496  252.680 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   71.376     90.336   0.790 0.431630    
## avgtemp        6.387      1.639   3.898 0.000192 ***
## weekdayTRUE  -66.958     25.505  -2.625 0.010244 *  
## summer       -59.977     41.052  -1.461 0.147662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111.1 on 86 degrees of freedom
## Multiple R-squared:  0.2659, Adjusted R-squared:  0.2402 
## F-statistic: 10.38 on 3 and 86 DF,  p-value: 6.714e-06
par(mfrow = c(2, 2))
plot(model3)

So, we find only a pretty small improvement in \(r^2\) from \(0.248\) to \(0.266\), and the p-value of the summer coefficient indicates that the predictor is not significant!

Next, let’s plot the model.

ggPredict(model3, interactive = TRUE)

We find four parallel lines, as expected! However, we’ll leave summer out of the model, as the p-value of its coefficient indicates that it is not significant.

Why did summer turn out not to be a significant predictor of volume? Likely because avgtemp and summer are correlated:

RailTrail_trim %>%
  summarise(cor = cor(summer, volume))

so it is possible that we gain little extra information in the model by including summer along with avgtemp. In other words, high average temperatures are likely already a strong enough basis to indicate a summer’s day.


4 Interactions - terms involving two or more predictors

Let’s try to interpret the following plot:

RailTrail_trim %>%
  ggplot(aes(x = avgtemp, y = volume, color = weekday)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

What we see is equivalent to grouping the data by weekday and obtaining best-fit lines of volume versus avgtemp for each group of data independently. It’s clear that the slopes and intercepts of these best-fit lines vary depending on whether weekday is TRUE or FALSE.

How can we add this behaviour into our model? By adding an interaction between avgtemp and weekday.

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{avgtemp:weekday}} \times \textrm{avgtemp} \times \textrm{weekday}\]

The last term here is the interaction. Notice that it involves the product of the two predictors: \(\textrm{avgtemp} \times \textrm{weekday}\).

We do this in patsy formula notation with the \(:\) operator as follows:

model4 <- lm(volume ~ avgtemp + weekday + avgtemp:weekday, data = RailTrail_trim)

We can read \(:\) as ‘…interacting with…’, so + avgtemp:weekday can be read as ‘add avgtemp interacting with weekday. Let’s visualise the model

ggPredict(model4, interactive = TRUE)

This looks promising! It’s identical to the graph we plotted earlier!


4.1 Interactions via the \(*\) operator

Alternatively, we could have equivalently specified model4 as

alt_model4 <- lm(volume ~ avgtemp * weekday, data = RailTrail_trim)
ggPredict(alt_model4, interactive = TRUE)

We can read \(*\) as ‘these predictors and all possible interactions between them’. For the moment, though, it’s probably easier to stick to using \(:\) explicitly to include interactions.

Task - 2 mins

Examine the summary of the model including the interaction between avgtemp and weekday. Is our inclusion of the interaction justified?

Solution

summary(model4) # or summary(alt_model4)
## 
## Call:
## lm(formula = volume ~ avgtemp + weekday + avgtemp:weekday, data = RailTrail_trim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -269.906  -78.862    2.064   68.190  291.641 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          285.887    103.390   2.765  0.00696 **
## avgtemp                2.457      1.718   1.431  0.15619   
## weekdayTRUE         -262.193    128.181  -2.045  0.04386 * 
## avgtemp:weekdayTRUE    3.300      2.161   1.527  0.13040   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111 on 86 degrees of freedom
## Multiple R-squared:  0.2675, Adjusted R-squared:  0.2419 
## F-statistic: 10.47 on 3 and 86 DF,  p-value: 6.115e-06
par(mfrow = c(2, 2))
plot(model4)

Alas, including this interaction is not justified, the p-value of the coefficient at approximately \(0.13\) is greater than our significance level of \(0.05\).

Why does the interaction term lead to different slopes and intercepts?

Why does including the avgtemp:weekday interaction term lead to different slopes and intercepts for weekday = TRUE and weekday = FALSE?

To see this, let’s rearrange the regression equation:

\[\widehat{\textrm{volume}} = (\textrm{intercept} + b_{\textrm{weekday}} \times \textrm{weekday}) + (b_{\textrm{avgtemp}} + b_{\textrm{avgtemp:weekday}} \times \textrm{weekday}) \times \textrm{avgtemp}\]

The first term in parentheses \(()\) is the overall ‘intercept’, and the second term in parentheses is the overall ‘slope’, and we see now that these depend on weekday!

case ‘intercept’ ‘slope’
weekday = FALSE (or 0) \(\textrm{intercept}\) \(b_{\textrm{avgtemp}}\)
weekday = TRUE (or 1) \(\textrm{intercept} + b_{\textrm{weekday}}\) \(b_{\textrm{avgtemp}} + b_{\textrm{avgtemp:weekday}}\)
The two different cases each lead to distinct slopes and intercepts. The other way to think about this is that we now have \(4\) adjustable coefficients in the regression model. A line is completely determined once we decide on a slope and intercept, so the four coefficients give us enough freedom to fit two completely distinct lines.



5 Adding another continuous predictor

We’re not having a lot of luck in model development so far (mainly because we are not being systematic), but let’s keep doggedly trying to improve our model! This time, we’ll add another continuous predictor, say cloudcover, which measures the cloud cover observed each day (apparently in ‘oktas’, or eigths of the sky, but the values seem more consistent with tenths of the sky).

RailTrail_trim %>%
  summarise(min = min(cloudcover), max = max(cloudcover))

Let’s first investigate whether cloudcover and volume are associated:

RailTrail_trim %>%
  ggplot(aes(x = cloudcover, y = volume)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

We have a reasonable association, but bear in mind that we are still not being systematic, as we haven’t subtracted off variation due to avgtemp and weekday. Nevertheless, let’s go ahead and add cloudcover to the regression

# add cloudcover before weekday for ggPredict() later
model5 <- lm(volume ~ avgtemp + cloudcover + weekday, data = RailTrail_trim)
summary(model5)
## 
## Call:
## lm(formula = volume ~ avgtemp + cloudcover + weekday, data = RailTrail_trim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -204.803  -58.375    6.594   60.178  277.581 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 200.1022    59.1826   3.381  0.00109 ** 
## avgtemp       5.2461     0.9536   5.502 3.81e-07 ***
## cloudcover  -16.0115     3.3952  -4.716 9.21e-06 ***
## weekdayTRUE -47.9480    23.4061  -2.049  0.04356 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.3 on 86 degrees of freedom
## Multiple R-squared:  0.4022, Adjusted R-squared:  0.3814 
## F-statistic: 19.29 on 3 and 86 DF,  p-value: 1.185e-09
par(mfrow = c(2, 2))
plot(model5)

Yay, progress again! It looks like cloudcover is useful: \(r^2\) increases from \(0.25\) to \(0.40\), and the residual standard error value drops by approximately \(11\) users. The coefficient of cloudcover is statistically significant too.

The regression equation is now

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{cloudcover}} \times \textrm{cloudcover}\]

Unfortunately, we start to run into problems visualising models with two or more continuous predictors due to the number of dimensions involved. This is about as far as we can go, because humans can really only visualise \(3\)-dimensions, and here we have now one response dimension volume, plus two continuous predictor dimensions: avgtemp and cloudcover, together with a categorical predictor weekday.

A linear model involving two continuous predictors gives us a plane of best-fit. Here, with two continuous- and one categorical predictor, we go one step further and generate a parallel planes model similar in concept to the parallel slopes model we saw above.

Let’s see a \(2\)-dimensional representation of this model using facets (as generated automatically by ggPredict())

ggPredict(model5, colorn = 10, interactive = TRUE)

To interpret this plot, imagine in each facet that there is a cloudcover axis pointing into the plane of the screen, starting at \(0\) closest to you, and reaching \(10\) some distance behind the screen. The ten lines plotted show where the best fit plane lies at that value of cloudcover. So, in both cases, the plane would slope downwards ‘into’ the screen, i.e. increasing cloudcover leads to decreasing volume of use of the rail trail. We see the parallel planes model clearly, as we have two distinct but parallel planes (one in each facet) corresponding to weekday = TRUE and weekday = FALSE.

This is about as far as we can reasonably go; we’ll stop trying to visualise models beyond this level of complexity.


6 Interpreting multiple regression coefficients

We will now think about how to interpret fitted coefficients in multiple linear regression. Let’s look again at the regression model we used above:

\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{cloudcover}} \times \textrm{cloudcover}\]

Now, when R adds weekday into the model, it generates a dummy variable weekdayTRUE which is either \(1\) or \(0\). If \(1\) for a row, then the fitted coefficient \(-47.9480\) is added to that value of the estimated response \(\widehat{\textrm{volume}}\). We would interpret this as ‘on average, weekdays experience volume that is \(48\) users lower than weekend days, for constant values of avgtemp and cloudcover’.

So, much for categorical predictors, but what about the two continuous predictors avgtemp and cloudcover. How do we interpret their fitted coefficients? In the following way:


The fitted coefficient of a continuous predictor in a multiple linear regression model equals the change in response variable for a \(1\) unit increase in that predictor with all other predictor variables held constant.


By ‘constant’ here we mean ‘fixed’. So, for a specific example, the fitted coefficient \(b_{\textrm{avgtemp}} = 5.2461\) can be interpreted as follows: ‘a \(1\) degree Fahrenheit increase in average temperature raises the rail trail user volume by \(5.25\) users on average, with all other predictors held constant’.


Task - 5 mins

Add precip (daily precipitation in inches) into the regression model. Perform diagnostics, and if you find that precip is a significant predictor, interpret its fitted coefficient.

Solution

model6 <- lm(volume ~ avgtemp + cloudcover + weekday + precip, data = RailTrail_trim)
summary(model6)
## 
## Call:
## lm(formula = volume ~ avgtemp + cloudcover + weekday + precip, 
##     data = RailTrail_trim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -218.881  -58.924    3.428   57.285  266.250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  155.0024    59.4574   2.607 0.010787 *  
## avgtemp        5.8743     0.9485   6.193 2.02e-08 ***
## cloudcover   -12.8294     3.4783  -3.688 0.000397 ***
## weekdayTRUE  -45.8821    22.5943  -2.031 0.045412 *  
## precip      -117.5681    43.2329  -2.719 0.007928 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.72 on 85 degrees of freedom
## Multiple R-squared:  0.4501, Adjusted R-squared:  0.4242 
## F-statistic: 17.39 on 4 and 85 DF,  p-value: 1.849e-10
par(mfrow = c(2, 2))
plot(model6)

Looks promising! All the coefficients are statistically significant at a significance level of \(0.05\). We interpret the coefficient of precip as follows: ‘a \(1\) inch increase in daily precipitation lowers rail trail volume by approximately \(118\) users on average, with all other predictors held constant’.



7 Interaction of continuous predictors

Imagine we come up with the idea that there should be an interaction between avgtemp and precip, i.e. we think something along the lines that “days that are both hot and wet should lead to particularly low usage of the rail trail”. Frankly, that does sound like an unpleasant environment in which to run or cycle!

We can straightforwardly add this interaction in patsy notation as avgtemp:precip.

model7 <- lm(volume ~ avgtemp + cloudcover + weekday + precip + avgtemp:precip, data = RailTrail_trim)
summary(model7)
## 
## Call:
## lm(formula = volume ~ avgtemp + cloudcover + weekday + precip + 
##     avgtemp:precip, data = RailTrail_trim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -218.775  -59.164    3.183   57.361  266.189 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     153.9280    63.9007   2.409 0.018189 *  
## avgtemp           5.8964     1.0598   5.564 3.08e-07 ***
## cloudcover      -12.8866     3.6984  -3.484 0.000786 ***
## weekdayTRUE     -45.8527    22.7365  -2.017 0.046921 *  
## precip         -100.8213   353.3547  -0.285 0.776097    
## avgtemp:precip   -0.2335     4.8893  -0.048 0.962023    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.29 on 84 degrees of freedom
## Multiple R-squared:  0.4501, Adjusted R-squared:  0.4173 
## F-statistic: 13.75 on 5 and 84 DF,  p-value: 8.34e-10
par(mfrow = c(2, 2))
plot(model7)

We see, however, that inclusion of the interaction is not justified, its own fitted coefficient is not statistically significant, and it also renders precip an insignificant predictor. Not good.

How could we have investigated this interaction before we added it to the model? The conditional plot is helpful! We can obtain a conditional plot in R using the coplot() function as follows:

coplot(volume ~ avgtemp | precip,
       data = RailTrail_trim)

We tell the function to plot volume against avgtemp, conditional on the value of precip using the vertical bar | to mean ‘conditional upon’, as in the earlier lessons on probability theory.

Think of this plot as showing


volume versus avgtemp given precip


You may also hear this called a ‘shingles’ or ‘roof tile’ plot, because the top graph resembles overlapping tiles on a roof (AKA shingles in some dialects). If we like we can add a best fit line to each panel using the panel = argument (this takes in a function, telling coplot() what to do in each panel). We can also control the degree of overlap of the shingles with the overlap = argument.

coplot(volume ~ avgtemp | precip,
       panel = function(x, y, ...){
         points(x, y)
         abline(lm(y ~ x), col = "blue")
       },
       overlap = 0.2,
       data = RailTrail_trim)

How do we interpret the results of coplot()? First, realise that the three scatter plots in the bottom part of the graph correspond to the three shingles in the top plot in the following order:

  • the bottom left scatter plot corresponds to the bottom shingle
  • the bottom right scatter plot corresponds to the middle shingle
  • the top left scatter plot corresponds to the top shingle

Coplots always go this way: in order, left to right in each row and then bottom to top in rows of scatter plots correspond to the shingles from bottom to top. Each scatter plot shows \(x\) and \(y\) data for the \(z\) values in the corresponding shingle. So, for the plot above, we have

  • the bottom left scatter plot corresponds to volume versus avgtemp for precip values in a narrow range from \(0\) inches to approximately \(0.01\) inches (i.e. essentially the days where it did not rain).
  • the bottom right scatter plot corresponds to volume versus avgtemp for precip values in a broader range from \(0\) inches to approximately \(0.16\) inches.
  • the top left scatter plot corresponds to volume versus avgtemp for precip values in a very broad range from approximately \(0.14\) inches to approximately \(1.5\) inches.

Why have the shingles ended up being the width that they are? R tries to arrange the shingles so that, as far as possible, they contain near equal numbers of data points.


Now, here’s the key point. If we examine the scatter points and find that the slope of the best fit line varies systematically as the shingle range varies, this indicates a potentially significant interaction between the \(x\) predictor and the conditioning predictor!

Here’s a plot showing schematically what a coplot() for a significant interaction might look like

**Fig. 1** Coplot showing a clear interaction between x and conditioning variable z. The lines slopes vary systematically from positive to negative as the z-range varies.

Fig. 1 Coplot showing a clear interaction between x and conditioning variable z. The lines slopes vary systematically from positive to negative as the z-range varies.


Alas, we can see no clear systematic change in the slopes of the best fit lines in our coplot(volume ~ avgtemp | precip,...), indicating that there is likely not a significant interaction between avgtemp and precip, as the regression diagnostics above confirmed for us.


8 Recap

  • How do we add additional predictors to a linear regression model?
    Answer Using the + operator in formula notation.


  • Under what circumstances would we expect multiple linear regression to yield planes of best fit?
    Answer We expect planes of best fit whenever we have two or more continuous predictors in a model.


  • What is an interaction term in multiple regression? What operator do we use to add an interaction?
    Answer An interaction term involves a product of two or more predictors. We can add an interaction using the : operator.


  • What is the general principle for interpreting a regression coefficient in multiple linear regression?
    Answer The fitted coefficient of a predictor in a multiple linear regression model equals the change in response variable for a \(1\) unit increase in that predictor (for a continuous predictor) or the difference in response relative to the reference level (for a categorical predictor) with all other predictor variables held constant.


  • What type of plot can we use to investigate an interaction between two continuous predictors? What R function do we use to obtain such a plot?
    Answer We can use a conditional plot to investigate an interaction between two continuous predictors. R provides the coplot() function for this purpose.

9 Additional resources